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Abstract. We prove that longest common prefix (LCP) information can be stored in much less space 
than previously known. More precisely, we show that in the presence of the text and the suffix array, 
o(n) additional bits are sufficient to answer LCP-queries asymptotically in the same time that is needed 
to retrieve an entry from the suffix array. This yields the smallest compressed suffix tree with sub- 
logarithmic navigation time. 



1 Introduction 

Augmenting the suffix-array |7|15] with an additional array holding the lengths of longest common 
prefixes drastically increases its functionality |15)l)2j . Stored as a plain array, this so-called L CP- 
array occupies nflogn] bits for a text of length n. Sadakane [21J shows that less space is actually 



needed, namely In + o(n) bits if the suffix array is available at lookup time (see Sect. 2.3 for more 
details). Due to the 2n-bit term, this data structure is nevertheless incompressible, even for highly 
regular texts. As a simple example, suppose the text consists of only a's. Then the suffix array can 
be compressed to almost negligible space |20f9|3], while Sadakane's LCP-array cannot. 

Text regularity is usually measured in k-th order empirical entropy [JS]. We have < < 
logo" for a text on an alphabet of size a, with Hk being "small" for compressible texts. In Sect. 
3 of this article, we prove that the LCP-array can be stored in O ^ log ^ g - ^ or nH^ + o{n) bits 

(depending on how the text itself is stored) , while giving access to an arbitrary LCP- value in time 
0(log 5 n) (arbitrary constant < 5 < 1). This should be compared to other compressed or sampled 
variants of the LCP-array. We are aware of three such methods: 

1. Russo et al. [19] achieve nH^ + o(n) space, but retrieval time at least 0(log 1+e n), hence super- 
logarithmic (arbitrary constant < e < 1). 

2. Fischer et al. [5] achieve nH^ilog -g- + O(l)) bits, with retrieval time 0(log^ n), again for any 
constant < f3 < 1. Although the space vanishes if does, it is worse than our data structure. 

3. Karkkainen et al. [12, Lemma 3] also employ the idea of "sampling" the LCP-array, but achieve 
only amortized time bounds. Allowing the same space as for our data structure (O( log ^ gn ) bits 
on top of the suffix array and the text) , one would have to choose q = log n log log n in their 
scheme, yielding super-logarithmic O(lognloglogn) amortized retrieval time. 

Finally, in Sect. [4], we apply our new representation of the LCP-array to suffix trees. This yields 
the first compressed suffix tree with 0{nHk) bits of space and sub- logarithmic navigation-time for 
almost all operations. 



2 Definitions and Previous Results 

This section sketches some known data structures that we are going to make use of. Throughout 
this article, we use the standard word-RAM model of computation, in which we have a computer 
with word-width w, where logn = 0(w). Fundamental arithmetic operations (addition, shifts, 
multiplication, . . . ) on ui-bit wide words can be computed in 0(1) time. 
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A =10 482593716 
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Fig. 1. Illustration to the succinct representation of the LCP-array. 



2.1 Rank and Select on Binary Strings 

Consider a bit-string S[l,n] of length n. We define the fundamental rank- and se/eci-operations 
on S as follows: rank\(S,i) gives the number of l's in the prefix and select\(S,i) gives 

the position of the i'th 1 in S, reading S from left to right (1 < i < n). Operations ranko(S,i) 
and selectors, i) are defined similarly for 0-bits. There are data structures of size 0( nl °^ g 1 ° gn ) bits 
in addition to S that support 0(l)-rank- and select-operations, respectively |ll|6j . For an easily 
accessible exposition of these techniques, we refer the reader to the survey by Navarro and Makinen 
HSl Sect. 6.1]. 



2.2 Suffix- and LCP-Arrays 

The suffix array |7|15j for a given text T of length n is an array A[l,n] of integers s.t. T^m n < 
TyHj+ii „ for all 1 < i < n; i.e., A describes the lexicographic order of T's suffixes by "enumerating" 
them from the lexicographically smallest to the largest. It follows from this definition that A is a 
permutation of the numbers [l,n]. Take, for example, the string T = CACAACCAC$. Then A = 
[10,4,8,2,5,9,3,7,1,6]. Note that the suffix array is actually a "left-to-right" (i.e., alphabetical) 
enumeration of the leaves in the suffix tree Part II] for T$. As A stores n integers from the 
range [1, n], it takes n words (or ra[logra] bits) to store A in uncompressed form. However, there are 
also different variants of compressed suffix arrays; see again the survey by Navarro and Makinen 
|18j for an overview of this field. In all cases, the time to access an arbitrary entry A[i] rises to 
w(l); we denote this time by tA- All current compressed suffix arrays have = f2()og e n) in the 
worst case (arbitrary constant < e < 1), and there are indeed ones that achieve this time [9]. 

In the same way as suffix arrays store the leaves of the corresponding suffix tree, the LCP-array 
captures information on the heights of the internal nodes as follows. Array if[l, n] is defined such 
that H[i] holds the length of the longest common prefix of the lexicographically (i — l)'st and z'th 
smallest suffixes. In symbols, H[i] = max{A; : TMi-i]..A[i-i]+k-i = ^k[i]..A[i]+fc-i} f° r all 1 < i < n, 
and H[l] = 0. For T = CACAACCACS, H = [0,0,1,2,2,0,1,2,3,1]. Kasai et al. (T3] gave an 
algorithm to compute H in 0(n) time, and Manzini [T7] adapted this algorithm to work in-placeF] 



2.3 In A- o(n)-Bit Representation of the LCP- Array 

Let us now come to the description of the succinct representation of the LCP-array due to Sadakane 
|21j . The key to his result is the fact that the LCP-values cannot decrease too much if listed in 

1 Makinen |14l Fig. 3] gives another algorithm to compute H almost in-place. 
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the order of the inverse suffix array A 1 (defined by A = j iff A[j] = i), a fact first proved by 
Kasai et al. [13, Thm. 1]: 

Proposition 1. For all i > 1, H[A- l [i]} > iTLA -1 ^ - 1]] - 1. □ 

Because < n — i + l (the LCP- value cannot be longer than the length of the suffix!), this 

implies that -ff [^4 _1 [1]] + 1, -ff [^4 _1 [2]] + 2, . . . , + n is an increasing sequence of integers 

in the range [l,ri\. Now this list can be encoded differentially: for all i = 1, 2, . . . ,n, subsequently 
write the difference I[i] := — {{[A^ 1 ^ — 1]] + 1 of neighboring elements in unary code 

O^Ml into a bit- vector S, where we assume i/[yl _1 [— 1]] = 0. Here, s denotes the juxtaposition 
of x zeros. See also Fig. [T] Combining this with the fact that the LCP-values are all less than re, 
it is obvious that there are at most n zeros and exactly n ones in S. Further, if we prepare S for 
constant-time ranko- and sefecii-queries, we can retrieve an entry from H by 

H[i] = ranh{S,selecti{S,A[i])) - A[i] . (1) 

This is because the se/ecii-statement gives the position where the encoding for //[yl[i]] ends in H, 
and the ran&o-statement counts the sum of the I[j]'s for 1 < j < A[i]. So subtracting the value 
A[i], which has been "artificially" added to the LCP-array, yields the correct value. See Fig. [T] for 
an example. Because ranko(H, selecti(H, x)) = select\(H,x) — x, we can rewrite to 

H[i] = select^S, A[i]) - 2A[i] , (2) 

such that only one select-call has to be invoked. 
This leads to 

Proposition 2 (Succinct representation of LCP-arrays). The LCP-array for a text of length 
n can be stored in 2n + 0( " 1 ° | ^ g 1 ° gn ) bits in addition to the suffix array, while being able to access 
its elements in time O(t^). □ 



3 Less Space, Same Time 



The solution from Sect. 2.3 is admittedly elegant, but certainly leaves room for further improve- 
ments. Because the bit-vector S stores the LCP-values in text order, we first have to convert the 
position in A to the corresponding position in the text. Hence, the lookup time to H is dominated 
by tji, the time needed to retrieve an element from from the compressed suffix array. Intuitively, 
this means that we could take up to 0{tj\) time to answer the select-query, without slowing down 
the whole lookup asymptotically. Although this is not exactly what we do, keeping this idea in 
mind is helpful for the proof of the following theorem. 

Theorem 1. Let T be a text of length n with 0{\)-access to its characters. Then the LCP-array 
for T can be stored in O( log \ ogn ) = o(n) bits in addition to T and to the suffix array, while being 
able to access its elements in 0(tA + log 5 re) time (arbitrary constant < 5 < 1). 



Proof. We build on the solution from Sect. 2.3 Let j = A[i\. From ([2]), we compute H[i] as 



select\{S,j) — 2j. Computing A[i] takes time tA- Thus, if we could answer the select-query in the 
same time (using O( log ^ gra ) additional bits), we were done. We now describe a data structure 
that achieves essentially this. Our description follows in most parts the solution due to Navarro 
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Fig. 2. Illustration to the proof of Thm. [TJ We know from the value of a that the first m = 
max(a — 2j, 0) characters of Tj_ n and Tji ___ n match (here with common prefix a). At most s = log 5 n 
further characters will match (here /3), until reaching a mismatch character x <i cx V- 




and Makinen [18j Sect. 6.1], except that it does not store sequence S and the lookup-table on the 
deepest level. 

We divide the range of arguments for selecti into subranges of size k = [log 2 nj , and store in 
N[i] the answer to select\(S, in). This table N[l, [|]] needs 0(^ logn) = 0{^^) bits, and divides 
S into blocks of different size, each containing ft l's (apart from the last). 

A block is called long if it spans more than k 2 = (9(log 4 n) positions in S, and short otherwise. 
For the long blocks, we store the answers to all seZecii-queries explicitly in a table P. Because 
there are at most k 2 long blocks, P requires O x k x logn) = 0(n/ log 4 n x log 2 n x logn) = 
0{nj logn) bits. 

Short blocks contain k 1-bits and span at most k 2 positions in 5. We divide again their range 
of arguments into sub-ranges of size A = [log 2 kJ = (9 (log 2 logn). In iV'[i] 3 we store the answer 
to select\(S,iX), this time only relative to the beginning of the block where i occurs. Because the 
values in N' are in the range [1,k 2 ], table iV'[l, [?]] needs O (? x logn) = 0(n/loglogn) bits. 
Table N' divides the blocks into miniblocks, each containing A 1-bits. 

Miniblocks are called long if they span more than s = log* 5 n bits, and short otherwise. For 
long miniblocks, we store again the answers to all select-queries explicitly in a table P', relative to 
the beginning of the corresponding block. Because the miniblocks are contained in short blocks of 
length < k 2 , the answer to such a select-query takes 0(log k) bits of space. Thus, the total space for 
P' is 0(nj ' s x A x log k) = O ( ~ ~ ) bits. This concludes the description of our data structure 
for select. 

To answer a query select\{S ', j) , let a = select\{S, [j'/AJA) be the beginning of j's mini-block in 
5. Likewise, compute the beginning of the next mini-block as b = selecti(S, [j/AJA + A). Now if 
b — a > s, then the mini-block where i occurs is long, and we can look up the answer using our 
precomputed tables N, N' , P, and P'. Otherwise, we return the value a as an approximation to 
the actual value of selecti(S, j). 

We now use the text T to compute H[i] in additional 0(log 5 n) time. To this end, let j' = A[i— 1] 
(see also Fig. [2]). The unknown value H[i] equals the length of the longest common prefix of suffixes 
Tj...n and Ty_ n , so we need to compare these suffixes. However, we do not have to compare letters 
from scratch, because we already know that the first m = max(a — 2j, 0) characters of these suffixes 
match. So we start the comparison at Tj +m and Tj> +m , and compare as long as they match. Because 
b — a < s = log 5 n, we will reach a mismatch after at most s character comparisons. Hence, the 
additional time for the character comparisons is 0(log s n). □ 
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If the text is not available for O(l) access, we have two options. First, we can always compress 
it with recent methods |22)4|8j to nH^ + o(n) space (which is within the space of all compressed 
suffix arrays), while still guaranteeing 0(1) random access to its characters: 

Corollary 1. The L CP- array for a text of length n can be stored in nH^+O ( — log + i^jj 

bits in addition to the suffix array (simultaneously over all k G o(log a n) for alphabet size a), while 
being able to access its elements in time 0(tA + log 5 n) (arbitrary constant < 5 < 1). □ 

The second option is to use the compressed suffix array itself to retrieve characters in 0(tA) 
time. This is either already provided by the compressed suffix array [20J, or can be simulated [5]. 
This leads to 

Corollary 2. Let A be a compressed suffix array for a text of length n with access time tA = 
0(log e n). Then the LCP-array can be stored in o(n) bits in addition to the suffix array, while being 
able to access its elements in time 0(log e+<s n) (arbitrary constant < S < 1). □ 

Note in particular that all known compressed suffix arrays have worst-case lookup time 0(log e n) 
at the very best, so the requirements on tA in Cor. [2] are no restriction on its applicability. Further, 
by choosing e and 5 such that e + 5 < 1, the time to access the LCP- values remains sub- logarithmic. 



3.1 Improved Retrieval Time 

Additional time could be saved in Thm.[T]and Cor.[T]by noting that a chunk of log a n text characters 
can be processed in O(l) time in the RAM-model for alphabet size a. Hence, when comparing the 
at most s = log 5 n characters from suffixes Tj +m ... n and Tj/ +m ... n (end of the proof of Thm. m), 
this could be done by processing at most s/\og a n = log a log 5-1 n such chunks. This is especially 

interesting if the alphabet size is small; in particular, if a = O ^2( log the retrieval time 

becomes constant. 

The same improvement is possible for Karkkainen et al. 's solution |T2"] , resulting in O (q log a / log n) 
amortized retrieval time in their scheme. 



4 A Small Entropy-Bounded Compressed Suffix Tree 

The data structure from Thm. [T] is particularly appealing in the context of compressed suffix 
trees. Fischer et al. give a compressed suffix tree that has sub-logarithmic time for almost 
all navigational operations. It is based on the compressed suffix array due to Grossi et al. [S], a 
compressed LCP-array, and data structures for range minimum- and previous/next smaller value- 
queries (RMQ and PNSV). Its size is nH k (2log^- + \ + 0(1)) + o(n) bits, where the "ugly" 
nH^ilog jj- + 0(l))-term comes from a compressed form of the LCP-array. If we replace this data 
structure with our new representation, we get (using Cor. [2] for simplicity): 

Theorem 2. A suffix tree can be stored in (1 + ^)nHk + o(n) bits such that all operations can be 
computed in sub-logarithmic time (except level ancestor queries, which have an additional O(logn) 
penalty). 

Proof. The space can be split into (1 + ^)nHk + o(n) bits from the compressed suffix array [9], 
additional o{n) bits from the LCP-array of Cor. [2j plus o{n) bits for the RMQ- and PNSV -queries. 
The time bounds are obtained from the third column of Table 1 in [5] . □ 
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Other trade-offs than those in Thm. [2] are possible, e.g., by taking different suffix arrays, or by 
preferring the LCP-array from Cor. [T] over that of Cor. [2} 
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